Jesse Ekweozoh, Nistha Mitra and William Jonathan Faoro

INTRODUCTION

2020 has been historical in ways more than one. One of the most crucial happenings was the Black Lives Matter movement. The wrongful death of George Floyd while in police custody in Minneapolis served as a catalyst of a global uprising of the BLM movement. George Floyd's death served as a face for the enumerable deaths of BIPOC in the hands of the police. Massive outcry on social media and in-person protests sparked the beginning of a massive change from Police department budget cuts to resignation and the removal of monuments and statues.

The central objective of this project is to analyze data containing fatal shootings by the police, using data analysis tools and techniques, and understand the disproportionate impact on Black, Indigenous and People of Color in the United States between 2015 and 2020.We aimed for our work to be a practical tutorial of data visualization and analysis with an in depth step by step description of the code and the logic. Moreover, in the midst of racism discussions, we hope to likely shed light on the mentioned situation that impacts all of us.

Required Tools

You will need the following libraries for this project:

  1. requests
  2. pandas
  3. numpys
  4. html5lib
  5. datetime
  6. Folium
  7. Seaborn
  8. matplotlib.pyplot

Read through the following resources for more information about pandas/installation and python 3.6 in general:

  1. https://pandas.pydata.org/pandas-docs/stable/install.html
  2. https://docs.python.org/3/ **

DATA COLLECTION

Data collection is a systematic process of gathering observations or measurements. While methods and aims may differ between fields, the overall process of data collection remains largely the same. Before you begin collecting data, you need to consider:

  1. The aim of the research
  2. The type of data that you will collect
  3. The methods and procedures you will use to collect, store, and process the data

Below we have the code to import all the necessary packages we will need in this project.

In [489]:
import sidetable
import pandas as pd #pandas
import numpy as np #module
import seaborn as sns #to visualize
import matplotlib.pyplot as plt #for plotting
import requests
import warnings
warnings.filterwarnings("ignore")

Data Source and Description

Our aim of the research has been described above. One of the main dataset we will use that will help our thesis statement well is from The Washington Post. The Washington Post released a dataset containing fatal shootings by police in the US between 2015 and 2020. The dataset is available in this repo by the Washington Post. https://github.com/washingtonpost/data-police-shootings

Another data set I am using describes the political results of the 2016 results based on county and states. We will use this to analyze the relation between political affiliation and police shooting. You can access the raw data from https://raw.githubusercontent.com/tonmcg/US_County_Level_Election_Results_08-20/master/2016_US_County_Level_Presidential_Results.csv

Lastly I used a dataset with state names and abbreviation. I would use this to modify the original dataset and make it more readable. https://raw.githubusercontent.com/jasonong/List-of-US-States/master/states.csv

In [490]:
df = pd.read_csv("https://github.com/washingtonpost/data-police-shootings/releases/download/v0.1/fatal-police-shootings-data.csv")
In [491]:
df1 = pd.read_csv("https://raw.githubusercontent.com/tonmcg/US_County_Level_Election_Results_08-20/master/2016_US_County_Level_Presidential_Results.csv")
df3= pd.read_csv("https://raw.githubusercontent.com/jasonong/List-of-US-States/master/states.csv")

A DataFrame is a structure similar to a table or matrix, with rows and columns that contain certain data. Pandas allows us to easily perform a lot of manipulations on DataFrames through the use of their functions. You can find more info at: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html.

Data Set 1

You can see the data below that describes the individual shooting and the information known. The df.head() code shows the top few rows. It helps us see the structure of the data and the columns.

In [492]:
print("Fatal shooting dataset has {}".format(df.shape[0]),
"rows and {}".format(df.shape[1]), "columns")
df.head(2)
Fatal shooting dataset has 5887 rows and 17 columns
Out[492]:
id name date manner_of_death armed age gender race city state signs_of_mental_illness threat_level flee body_camera longitude latitude is_geocoding_exact
0 3 Tim Elliot 2015-01-02 shot gun 53.0 M A Shelton WA True attack Not fleeing False -123.122 47.247 True
1 4 Lewis Lee Lembke 2015-01-02 shot gun 47.0 M W Aloha OR False attack Not fleeing False -122.892 45.487 True

Data Set 2

The following data set gives us the result of the 2016 election. We don't need the entire data set and it will later be cleaned according to our needs.

In [493]:
print("Fatal shooting dataset has {}".format(df1.shape[0]),
"rows and {}".format(df1.shape[1]), "columns")
df1.head(57)
Fatal shooting dataset has 3141 rows and 11 columns
Out[493]:
Unnamed: 0 votes_dem votes_gop total_votes per_dem per_gop diff per_point_diff state_abbr county_name combined_fips
0 0 93003.0 130413.0 246588.0 0.377159 0.528870 37,410 15.17% AK Alaska 2013
1 1 93003.0 130413.0 246588.0 0.377159 0.528870 37,410 15.17% AK Alaska 2016
2 2 93003.0 130413.0 246588.0 0.377159 0.528870 37,410 15.17% AK Alaska 2020
3 3 93003.0 130413.0 246588.0 0.377159 0.528870 37,410 15.17% AK Alaska 2050
4 4 93003.0 130413.0 246588.0 0.377159 0.528870 37,410 15.17% AK Alaska 2060
5 5 93003.0 130413.0 246588.0 0.377159 0.528870 37,410 15.17% AK Alaska 2068
6 6 93003.0 130413.0 246588.0 0.377159 0.528870 37,410 15.17% AK Alaska 2070
7 7 93003.0 130413.0 246588.0 0.377159 0.528870 37,410 15.17% AK Alaska 2090
8 8 93003.0 130413.0 246588.0 0.377159 0.528870 37,410 15.17% AK Alaska 2100
9 9 93003.0 130413.0 246588.0 0.377159 0.528870 37,410 15.17% AK Alaska 2105
10 10 93003.0 130413.0 246588.0 0.377159 0.528870 37,410 15.17% AK Alaska 2110
11 11 93003.0 130413.0 246588.0 0.377159 0.528870 37,410 15.17% AK Alaska 2122
12 12 93003.0 130413.0 246588.0 0.377159 0.528870 37,410 15.17% AK Alaska 2130
13 13 93003.0 130413.0 246588.0 0.377159 0.528870 37,410 15.17% AK Alaska 2150
14 14 93003.0 130413.0 246588.0 0.377159 0.528870 37,410 15.17% AK Alaska 2164
15 15 93003.0 130413.0 246588.0 0.377159 0.528870 37,410 15.17% AK Alaska 2170
16 16 93003.0 130413.0 246588.0 0.377159 0.528870 37,410 15.17% AK Alaska 2180
17 17 93003.0 130413.0 246588.0 0.377159 0.528870 37,410 15.17% AK Alaska 2185
18 18 93003.0 130413.0 246588.0 0.377159 0.528870 37,410 15.17% AK Alaska 2188
19 19 93003.0 130413.0 246588.0 0.377159 0.528870 37,410 15.17% AK Alaska 2195
20 20 93003.0 130413.0 246588.0 0.377159 0.528870 37,410 15.17% AK Alaska 2198
21 21 93003.0 130413.0 246588.0 0.377159 0.528870 37,410 15.17% AK Alaska 2220
22 22 93003.0 130413.0 246588.0 0.377159 0.528870 37,410 15.17% AK Alaska 2230
23 23 93003.0 130413.0 246588.0 0.377159 0.528870 37,410 15.17% AK Alaska 2240
24 24 93003.0 130413.0 246588.0 0.377159 0.528870 37,410 15.17% AK Alaska 2261
25 25 93003.0 130413.0 246588.0 0.377159 0.528870 37,410 15.17% AK Alaska 2270
26 26 93003.0 130413.0 246588.0 0.377159 0.528870 37,410 15.17% AK Alaska 2275
27 27 93003.0 130413.0 246588.0 0.377159 0.528870 37,410 15.17% AK Alaska 2282
28 28 93003.0 130413.0 246588.0 0.377159 0.528870 37,410 15.17% AK Alaska 2290
29 29 5908.0 18110.0 24661.0 0.239569 0.734358 12,202 49.48% AL Autauga County 1001
30 30 18409.0 72780.0 94090.0 0.195653 0.773515 54,371 57.79% AL Baldwin County 1003
31 31 4848.0 5431.0 10390.0 0.466603 0.522714 583 5.61% AL Barbour County 1005
32 32 1874.0 6733.0 8748.0 0.214220 0.769662 4,859 55.54% AL Bibb County 1007
33 33 2150.0 22808.0 25384.0 0.084699 0.898519 20,658 81.38% AL Blount County 1009
34 34 3530.0 1139.0 4701.0 0.750904 0.242289 2,391 50.86% AL Bullock County 1011
35 35 3716.0 4891.0 8685.0 0.427864 0.563155 1,175 13.53% AL Butler County 1013
36 36 13197.0 32803.0 47376.0 0.278559 0.692397 19,606 41.38% AL Calhoun County 1015
37 37 5763.0 7803.0 13778.0 0.418276 0.566338 2,040 14.81% AL Chambers County 1017
38 38 1524.0 8809.0 10503.0 0.145101 0.838713 7,285 69.36% AL Cherokee County 1019
39 39 2909.0 15068.0 18255.0 0.159354 0.825418 12,159 66.61% AL Chilton County 1021
40 40 3109.0 4102.0 7268.0 0.427766 0.564392 993 13.66% AL Choctaw County 1023
41 41 5712.0 7109.0 12936.0 0.441558 0.549552 1,397 10.80% AL Clarke County 1025
42 42 1234.0 5230.0 6572.0 0.187766 0.795800 3,996 60.80% AL Clay County 1027
43 43 684.0 5738.0 6532.0 0.104715 0.878445 5,054 77.37% AL Cleburne County 1029
44 44 4194.0 15825.0 20513.0 0.204456 0.771462 11,631 56.70% AL Coffee County 1031
45 45 7296.0 16718.0 24626.0 0.296272 0.678876 9,422 38.26% AL Colbert County 1033
46 46 3069.0 3413.0 6543.0 0.469051 0.521626 344 5.26% AL Conecuh County 1035
47 47 1780.0 3376.0 5223.0 0.340800 0.646372 1,596 30.56% AL Coosa County 1037
48 48 2379.0 13222.0 15818.0 0.150398 0.835883 10,843 68.55% AL Covington County 1039
49 49 1663.0 4511.0 6252.0 0.265995 0.721529 2,848 45.55% AL Crenshaw County 1041
50 50 3730.0 32734.0 37278.0 0.100059 0.878105 29,004 77.80% AL Cullman County 1043
51 51 4408.0 13798.0 18617.0 0.236773 0.741151 9,390 50.44% AL Dale County 1045
52 52 12826.0 5784.0 18730.0 0.684784 0.308809 7,042 37.60% AL Dallas County 1047
53 53 3682.0 21779.0 26086.0 0.141149 0.834892 18,097 69.37% AL DeKalb County 1049
54 54 8436.0 27619.0 36905.0 0.228587 0.748381 19,183 51.98% AL Elmore County 1051
55 55 4698.0 10282.0 15213.0 0.308815 0.675869 5,584 36.71% AL Escambia County 1053
56 56 10350.0 32132.0 43474.0 0.238073 0.739108 21,782 50.10% AL Etowah County 1055

Data Set 3

This dataset is more to add convenience when looking at a plot. Abbreviations are helpful but in a plot with multiple variables, not having to remember every code helps.

In [494]:
df3=df3.rename(columns = {'Abbreviation':'state', 'State':'state_name'})
df3.head(3)
Out[494]:
state_name state
0 Alabama AL
1 Alaska AK
2 Arizona AZ

DATA PROCESSING

After collection comes processing. Here we mean everything from data cleaning, data wrangling, and data formatting to data compression, for efficient storage, and data encryption, for secure storage.

The 'join' fuction joins columns with other DataFrame either on index or on a key column. Efficiently join multiple DataFrame objects by index at once by passing a list.

Here, joining first dataset to the third to add the state names.

In [495]:
df= df.join(df3.set_index('state'), on='state')
df.head(3)
Out[495]:
id name date manner_of_death armed age gender race city state signs_of_mental_illness threat_level flee body_camera longitude latitude is_geocoding_exact state_name
0 3 Tim Elliot 2015-01-02 shot gun 53.0 M A Shelton WA True attack Not fleeing False -123.122 47.247 True Washington
1 4 Lewis Lee Lembke 2015-01-02 shot gun 47.0 M W Aloha OR False attack Not fleeing False -122.892 45.487 True Oregon
2 5 John Paul Quintero 2015-01-03 shot and Tasered unarmed 23.0 M H Wichita KS False other Not fleeing False -97.281 37.695 True Kansas

Body camera column is not useful for my analysis. Moreover as you see below it mostly has one value.We drop that column.

In [496]:
df.body_camera.value_counts()
Out[496]:
False    5154
True      733
Name: body_camera, dtype: int64

The names are too personal to use and might be against individual rights. The id column is redundant. Thus, I will drop these columns.

In [497]:
df.drop(['id','name', 'body_camera'], axis=1, inplace=True)
df.head(2)
Out[497]:
date manner_of_death armed age gender race city state signs_of_mental_illness threat_level flee longitude latitude is_geocoding_exact state_name
0 2015-01-02 shot gun 53.0 M A Shelton WA True attack Not fleeing -123.122 47.247 True Washington
1 2015-01-02 shot gun 47.0 M W Aloha OR False attack Not fleeing -122.892 45.487 True Oregon

Missing Value

“Armed”, “age”, “gender”, “race”, and “flee” columns have missing values.

In [498]:
df.isna().sum()
Out[498]:
date                         0
manner_of_death              0
armed                      212
age                        253
gender                       1
race                       546
city                         0
state                        0
signs_of_mental_illness      0
threat_level                 0
flee                       324
longitude                  288
latitude                   288
is_geocoding_exact           0
state_name                   0
dtype: int64

A more informative tool on missing values is the Seaborn heatmap

In [499]:
plt.figure(figsize=(10,7))
sns.heatmap(df.isnull(), cbar = False, cmap = 'viridis')
Out[499]:
<matplotlib.axes._subplots.AxesSubplot at 0x20dff6cbd90>

The “flee” and “armed” columns describe the action of the person being shot.

In [500]:
df.flee.value_counts()
Out[500]:
Not fleeing    3645
Car             967
Foot            759
Other           192
Name: flee, dtype: int64
In [501]:
df.armed.value_counts()
Out[501]:
gun                    3369
knife                   861
unarmed                 374
toy weapon              195
undetermined            174
                       ... 
grenade                   1
hand torch                1
barstool                  1
bean-bag gun              1
car, knife and mace       1
Name: armed, Length: 95, dtype: int64

The action of “not fleeing” dominates “flee” column. We can fill in the missing values with “not fleeing”.You can choose a different way to handle them like dropiing them.

In [502]:
df.armed.fillna(df.armed.value_counts().index[0], inplace=True)
df.flee.fillna('Not fleeing', inplace=True)

Drop any other column because the other columns describe the person being shot and thus it could be misleading to make assumption without accurate information.

In [503]:
df.dropna(axis=0, how='any', inplace=True)
print("There are {}".format(df.isna().sum().sum()), "missing values left in the dataframe")
There are 0 missing values left in the dataframe

The following heatmap shows that there are no more missing data in the modified dataset.

In [504]:
plt.figure(figsize=(10,7))
sns.heatmap(df.isnull(), cbar = False, cmap = 'viridis')
Out[504]:
<matplotlib.axes._subplots.AxesSubplot at 0x20d8422f2e0>

The following copy is made to be used later. The other changed made in df will not be required and thus we are copying and storing it now.

In [505]:
locationInfo = df.copy() 

Convert the date column to 'datetime' type which is the data type of pandas to handle dates. After converting, extract “year” from the date and create new column. We will use it to see yearly shooting rates.

In [506]:
df['date'] = pd.to_datetime(df['date'])
df['year'] = pd.to_datetime(df['date']).dt.year

The following code groups the rows by year and state which means that each State's shooting count can be seen per year.

In [507]:
df.insert(2, 'Count_per_year', df.groupby(['year','state'])['year'].transform('size'))
In [508]:
df.head(2)
Out[508]:
date manner_of_death Count_per_year armed age gender race city state signs_of_mental_illness threat_level flee longitude latitude is_geocoding_exact state_name year
0 2015-01-02 shot 16 gun 53.0 M A Shelton WA True attack Not fleeing -123.122 47.247 True Washington 2015
1 2015-01-02 shot 14 gun 47.0 M W Aloha OR False attack Not fleeing -122.892 45.487 True Oregon 2015

Making a dataset with just fatality counts per State in the 5 years. Group by state and all the occurance are added regardless of other attributes giving us the total.

In [509]:
kill_st=df.groupby(['state']).size().reset_index(name='counts')
kill_st.head(3)
Out[509]:
state counts
0 AK 31
1 AL 90
2 AR 60

We clean df1 (that is the second dataset). It has the election information. We drop the unnecessary columns and name the required column according to the dataframe we will join it with. In this case, we will join it with kill_st and name 'state_abbr' , 'state' .

In [510]:
df1.drop(['Unnamed: 0','total_votes','per_dem','per_gop','diff','per_point_diff','combined_fips','county_name'], axis=1, inplace=True)
df1=df1.rename(columns = {'state_abbr':'state'})

We join it with data frame 3 to add the names of the state and we get overall state votes.

In [511]:
df1= df1.join(df3.set_index('state'), on='state')
df1.head(3)
Out[511]:
votes_dem votes_gop state state_name
0 93003.0 130413.0 AK Alaska
1 93003.0 130413.0 AK Alaska
2 93003.0 130413.0 AK Alaska

The following code sums each county vote according to state to give total votes for Democrats and Republicans . We create another column that compares the two votes per state and declare the state Red or Blue (Republican or Democrat)

In [512]:
df1 = df1.groupby(['state']).agg({'votes_dem' : 'sum', 'votes_gop': 'sum'}).reset_index()
df1['pol_m'] = np.where(df1['votes_dem'] > df1['votes_gop'], "Blue", "Red")

Finally, if you merge the kill per state dataframe with the dataframe above, you get whether a state is Blue and Red and how many fatalities it saw in police shooting in the last 5 years.

In [513]:
by_state = pd.DataFrame()
by_state= pd.merge(df1, kill_st, on='state')
by_state= by_state.join(df3.set_index('state'), on='state')
In [514]:
by_state.head()
Out[514]:
state votes_dem votes_gop pol_m counts state_name
0 AK 2697087.0 3781977.0 Red 31 Alaska
1 AL 718084.0 1306925.0 Red 90 Alabama
2 AR 378729.0 677904.0 Red 60 Arkansas
3 AZ 936250.0 1021154.0 Red 224 Arizona
4 CA 7362490.0 3916209.0 Blue 723 California

DATASET BY STATE AND TIME

We copy the original data frame and drop duplicates columns that exist on year and state because each person who died that year in that state will have the same info. We end up with a dataset that tells us how many people died per year in that State.

In [515]:
df2= df.copy()
df2 = df2.drop(['manner_of_death','date','armed','age','gender','race','city','signs_of_mental_illness','threat_level','flee','longitude','latitude','is_geocoding_exact'], 1)
In [516]:
# dropping duplicate values 
df2= df2.drop_duplicates(['year','state'],keep= 'last')
df2.reset_index(inplace=True)
df2.drop(['index'], axis=1, inplace=True)
In [517]:
df2= df2.sort_values(['state_name'])
In [518]:
df2.tail(5)
Out[518]:
Count_per_year state state_name year
128 1 WY Wyoming 2017
54 2 WY Wyoming 2016
265 1 WY Wyoming 2020
205 1 WY Wyoming 2019
155 3 WY Wyoming 2018

DATA VISUALIZATION

BY STATE

In [519]:
##run a for loop to make seperate dataframes based on yearID and use Seaborn to plot them

by_state.set_index('state_name', inplace=True)
In [520]:
by_state.head()
Out[520]:
state votes_dem votes_gop pol_m counts
state_name
Alaska AK 2697087.0 3781977.0 Red 31
Alabama AL 718084.0 1306925.0 Red 90
Arkansas AR 378729.0 677904.0 Red 60
Arizona AZ 936250.0 1021154.0 Red 224
California CA 7362490.0 3916209.0 Blue 723

n this section we are going to visualize the data based on States. We use matplotlib to make a pie chart that shows how many kills happen per state in the last 5 years.

In [521]:
ax=by_state['counts'].plot(kind='pie', figsize=(20, 15))
ax.set_aspect('equal')
ax.yaxis.set_label_coords(-0.15, 0.5)
plt.show()

The pie chart was helpful but in order to make it more clear and also understand the mean or the average kills in all the states we make a bar plot. We use seaborne package to display the bar plot and aggregation to find the mean.

In [522]:
mean= by_state["counts"].mean()
mean
Out[522]:
98.09803921568627
In [523]:
fig, ax = plt.subplots()
fig.set_size_inches(11.7, 11)
plt.title("Kills by State")
fig = sns.barplot(y=by_state.index, x=by_state["counts"])
# adding a verticle line for the mean ratio
ax.axvline(mean, color="blue", linewidth=2)
Out[523]:
<matplotlib.lines.Line2D at 0x20d8431fd30>

We can clearly see that certain States like California, Florida and Texas have high numbers of fatalities by the Police. However we don't see a certain trend just based on the states.

We therefore plot each state by year to see if there are trends common in all states. We do this by pivoting the data frame we have so that the index could become years and the columns could become the states. Did after we plot each column by iterating through the data frame.

In [524]:
year_rate = df2.pivot(index='year', columns='state_name', values='Count_per_year')
In [525]:
for i, col in enumerate(year_rate.columns):
    year_rate[col].plot(fig=plt.figure(i))
    plt.title(col)

plt.show()

Again we do not see any certain trend in kills over the years in all states

We provide the states with the most kills below for reference to the plots.

In [526]:
df.stb.freq(['state'], thresh=50)
Out[526]:
state count percent cumulative_count cumulative_percent
0 CA 723 14.451329 723 14.451329
1 TX 441 8.814711 1164 23.266040
2 FL 349 6.975815 1513 30.241855
3 AZ 224 4.477314 1737 34.719168
4 CO 179 3.577853 1916 38.297022
5 GA 166 3.318009 2082 41.615031
6 OH 151 3.018189 2233 44.633220
7 OK 150 2.998201 2383 47.631421
8 others 2620 52.368579 5003 100.000000

BY POLITICAL AFFILIATION

Since we couldn't see any certain trend based on individual states, big group the states by its political affiliation based on the 2016 election. We take the by_state and use Group by an aggregate function to find the sum of kills in all blue States and all red States. Did after we use a bar plot to show which group of state has more police shooting fatalities.

In [527]:
by_pol= by_state.groupby(['pol_m']).agg({'counts' : 'sum'}).reset_index()
by_pol.reset_index(drop=True)
Out[527]:
pol_m counts
0 Blue 1932
1 Red 3071
In [528]:
br= by_pol.plot(kind='bar',x='pol_m',y='counts', 
        color=["blue","red"]) 
br.set_xlabel("Democrat Vs Republican")
br.set_ylabel("Fatal Police Shooting")
Out[528]:
Text(0, 0.5, 'Fatal Police Shooting')

We see that the fatalities in red states are significantly higher than the blue States. Please keep in mind that according to the 2016 election the presidential candidate of the Republican party was Donald Trump.

By AGE

In [529]:
plt.figure(figsize=(12,8))
plt.title('Age Distribution of Deaths', fontsize=15)
sns.distplot(df.age)
Out[529]:
<matplotlib.axes._subplots.AxesSubplot at 0x20df5be27f0>

BY EXACT LATITUDE AND LONGITUDE

In this section we visualize the data given in terms of the location and the identities of the data points. We use the folium package to show the places of concentration of the shootings in the United States of America and the identities that seem to dominate the data.

In [530]:
import folium
from folium import plugins
from folium.plugins import HeatMap
In [531]:
locationInfo.head(3)
Out[531]:
date manner_of_death armed age gender race city state signs_of_mental_illness threat_level flee longitude latitude is_geocoding_exact state_name
0 2015-01-02 shot gun 53.0 M A Shelton WA True attack Not fleeing -123.122 47.247 True Washington
1 2015-01-02 shot gun 47.0 M W Aloha OR False attack Not fleeing -122.892 45.487 True Oregon
2 2015-01-03 shot and Tasered unarmed 23.0 M H Wichita KS False other Not fleeing -97.281 37.695 True Kansas
In [532]:
def per_race(r):
    color = ''
    if r == 'B':
        color = 'black'
    elif r == 'W':
        color = 'white'
    elif r == 'U':
        color = 'green'
    elif r == 'A':
        color = 'yellow'
    elif r == 'O':
        color = 'blue'
    elif r == 'H':
        color = 'purple'
    elif r == 'N':
        color = 'crimson'
    return color
In [533]:
map_PoliceKillings = folium.Map()
cluster = folium.plugins.MarkerCluster(name="Fatal Police Shooting").add_to(map_PoliceKillings)
In [534]:
for person in locationInfo.itertuples():
    lat = person.latitude
    long = person.longitude
    if person.gender == 'M':
        sex_c= 'blue'
    else:
        sex_c= 'pink'
        
    race_c= per_race(person.race)
        
    sex = "Sex: {} ".format(person.gender)
    age = "Age: {}".format(person.age)
    race = "Race: {} ".format(person.race)
    armed = "Armed status {} ".format(person.armed)
    
    content = sex + "\n" + age + "\n"+ race  + "\n" + armed
        
    newMarker = folium.Marker([lat,long], popup=content, icon=folium.Icon(color=sex_c,icon_color=race_c)) 
    newMarker.add_to(cluster)
        
In [535]:
map_PoliceKillings
Out[535]:
Make this Notebook Trusted to load map: File -> Trust Notebook

We see that there are more white Americans who have been shot than any other race. This brings about the question of whether our thesis was even right?

BY Race Specefic

We asked the specific question of whether BIPOC was impacted more by fatal police shooting. However our map showed that there were more white people dying. However we realize that we forgot to add a very integral factor to our visualization. The population of these races.

We will use the populations in 2019 which is available on US Census website. Although the ratios have changed from 2015 to 2020, it is not a dramatic change like 10–15 percent. I think the ratios remain within a margin of a few percents. However, you can use exact populations in each year to be more accurate.

In [536]:
df_pop = pd.DataFrame({'race':['W','B','A','H','N','O'],
'population':[0.601, 0.134, 0.059, 0.185, 0.013, 0.008]})
df_pop['population'] = df_pop['population']*328
df_pop
Out[536]:
race population
0 W 197.128
1 B 43.952
2 A 19.352
3 H 60.680
4 N 4.264
5 O 2.624
In [537]:
df_race = df[['race','year','armed']].groupby(['race','year']).count().reset_index()
df_race.rename(columns={'armed':'number_of_deaths'}, inplace=True)
df_race.head(4)
Out[537]:
race year number_of_deaths
0 A 2015 13
1 A 2016 14
2 A 2017 14
3 A 2018 18
In [538]:
df_race = pd.merge(df_race, df_pop, on='race')
df_race['deaths_per_million'] = df_race['number_of_deaths'] / df_race['population']
df_race.head()
Out[538]:
race year number_of_deaths population deaths_per_million
0 A 2015 13 19.352 0.671765
1 A 2016 14 19.352 0.723439
2 A 2017 14 19.352 0.723439
3 A 2018 18 19.352 0.930136
4 A 2019 18 19.352 0.930136
In [539]:
plt.figure(figsize=(12,8))
plt.title("Fatal Shootings by Police", fontsize=15)
sns.barplot(x='year', y='deaths_per_million', hue='race', data=df_race )
Out[539]:
<matplotlib.axes._subplots.AxesSubplot at 0x20d8abdfbe0>

Conclusion

Our analysis employed the use of standard python libraries to import, modify and analyze our chosen dataset to support our central thesis. We used 3 datasets each consisting of fatal police shootings, political results of the 2016 elections by county/state and list of states and abbreviations. The purpose of these datasets was to draw a relation between political affiliation and police shooting. We performed several data manipulation techniques in cleaning our data for analysis. Data visualization was a useful tool in describing our findings. We used standard plotting libraries in python such as matplotlib to make pie charts and the seaborne package for our bar plots. We finally analyzed the number of deaths by race and discovered based on our bar plot that black people had the highest death rate by the police in America.

In [ ]: